Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Initializer container and handling its verdict #450

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

frittentheke
Copy link
Contributor

@frittentheke frittentheke commented Aug 22, 2024

Use exit codes (of k6 archive + inspect) as sole indication for initialization success

  1. Introduce an EmptyDir volume for temporary data
    This removes the need for any mkdir and avoids writing to the container filesystem,
    allowing for e.g. readOnlyRootFilesystem security contexts ([1]).

  2. Remove all output redirection and grepping for error indicating strings in favor of
    using the exit codes from the commands to indicate success or failure reestablishing the
    clear contract of the inspect command to judge the provided config.

This approach is much cleaner in regards to any changes to the output of K6 and the vast
space of error there is.

k6 inspect --execution-requirements ([2,3]) while also printing warnings or the JSON
should clearly indicate via err if there are issues with the provided tests, which
are then converted to a non-zero RC ([4]).

  1. Set TerminationMessagePolicy on Initializer to use logs as fallback

While not explicitly used, with the removal of the log fetching from the
Initializer (Job->Pod->Container) any potential source for verbose information
about the error might be lost. The Kubernetes approach to provide more details
than the exit code about the reason a container terminated is the termination
message (see [5]).

Per default this message needs to be explictly written to /dev/termination-log,
but there also is the option to use the last bit of log output as fallback source.

  1. Remove fetching of container logs from initializer

Following the focus on the exit code of the command using in the initializer job (Pod),
this commit removes the fetching of the container log.

There was only some basic JSON unmarshalling applied with no interpretation of what it contained.
This is either covered by k6 inspect exiting with rc=0 or should be added to the initializer
job.

If further details about the failure reason of the initalizer container was to be required,
the source should be the termination message. It could be used to enrich the TestRun CR to provide
a higher level of detail about the failure to the user.

[1] https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
[2] https://github.com/grafana/k6/blob/c212bd277aeedbf8395b917df11bd9be4da490a4/cmd/inspect.go#L34
[3] https://github.com/grafana/k6/blob/c212bd277aeedbf8395b917df11bd9be4da490a4/cmd/inspect.go#L64
[4] https://github.com/grafana/k6/blob/master/cmd/root.go#L90-L118
[5] https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

Fixes: #435

…alization success

1) Introduce an EmptyDir volume for temporary data
This removes the need for any mkdir and avoids writing to the container filesystem,
allowing for e.g. `readOnlyRootFilesystem` security contexts ([1]).

2) Remove all output redirection and grepping for error indicating strings in favor of
using the exit codes from the commands to indicate success or failure.
This approach is much cleaner in regards to any changes to the output of K6 and the vast
space of error there is.

It also reestablishes the clear contract of the `inspect` command to judge the provided
config.

`k6 inspect --execution-requirements` ([2,3]) while also printing warnings or the JSON
should clearly indicate via `err` if there are issues with the provided tests, which
are then converted to a non-zero RC ([4]).

[1] https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
[2] https://github.com/grafana/k6/blob/c212bd277aeedbf8395b917df11bd9be4da490a4/cmd/inspect.go#L34
[3] https://github.com/grafana/k6/blob/c212bd277aeedbf8395b917df11bd9be4da490a4/cmd/inspect.go#L64
[4] https://github.com/grafana/k6/blob/master/cmd/root.go#L90-L118

Fixes: grafana#435
While not explicitly used, with the removal of the log fetching from the
Initializer (Job->Pod->Container) any potential source for verbose information
about the error might be lost. The Kubernetes approach to provide more details
than the exit code about the reason a container terminated is the termination
message (see [1]).

Per default this message needs to be explictly written to `/dev/termination-log`,
but there also is the option to use the last bit of log output as fallback source.

[1] https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/
Following the focus on the exit code of the command using in the initializer job (Pod),
this commit removes the fetching of the container log.

There was only a basic JSON unmarshalling applied with no interpretion of what it contained.
This is either covered by `k6 inspect` exiting with rc=0 or should be added to the initializer
job.

If further details about the failure reason of the initalizer container was to be required,
the source should be the termination message. It could be used to enrich the TestRun CR to provide
a higher level of detail about the failure to the user.

[1] https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/#customizing-the-termination-message
@yorugac
Copy link
Collaborator

yorugac commented Nov 5, 2024

Hi @frittentheke, thanks for the suggestions. I'm afraid I don't see how to make these work right now 😔

Let's go point-by-point:

  1. EmptyDir: I'm yet to hear anyone bringing up readOnlyRootFilesystem as a requirement (unless you meant this requirement is present in your setup?). Nevertheless, it's a valid point. But by your PR, it is unclear what part is supposed to read that volume. For example, right now, the logs are being read by the app directly, but I don't think there's a guarantee that EmptyDir of another pod (initializer) would continue to exist long enough nor that the app and initializer will be scheduled on the same node in the first place.

    Going by the rest of the description, esp. point 4, I assume you suggest to actually remove parsing the log completely. That cannot be done ATM, at all. Why we're reading the logs is mentioned in this comment.

  2. Remove all output redirection and grepping for error indicating strings in favor of
    using the exit codes from the commands to indicate success or failure reestablishing the
    clear contract of the inspect command to judge the provided config.

    Sadly, k6 is not consistent enough around logging and exit codes both to risk such big adjustments here. If you're curious, you can see issues like this or this. The current logic of a complex command in initializer did not appear out of thin air, so to say: it is meant to cover all cases possible.

    Also, I suppose your PR 453 is kind of a continuation of the same idea: there is a more specific discussion over there.

  3. TerminationMessagePolicy on initializer: sounds interesting, but also a bit redundant, given our current reading of initializer logs.

  4. Remove fetching of container logs from initializer

    Already covered above.

@frittentheke
Copy link
Contributor Author

frittentheke commented Dec 11, 2024

Sorry @yorugac for not getting back to you on your extensive feedback.

Hi @frittentheke, thanks for the suggestions. I'm afraid I don't see how to make these work right now 😔

Let's go point-by-point:

1. `EmptyDir`: I'm yet to hear anyone bringing up `readOnlyRootFilesystem` as a requirement (unless you meant this requirement is present in your setup?). Nevertheless, it's a valid point. But by your PR, it is unclear what part is supposed to read that volume. For example, right now, the logs are being read by the app directly, but I don't think there's a guarantee that `EmptyDir` of another pod (initializer) would continue to exist long enough nor that the app and initializer will be scheduled on the same node in the first place.

You are reading too much into the EmptyDir change. This is simply following good practice when providing the container application some temporary space (and a folder, which had to be mkdired before) to write some output into. The data in there is not required after the Job / Pod is done. Also such an EmptyDir is tied to the lifecycle of a Pod an cannot possibly be reused. See https://kubernetes.io/docs/concepts/storage/volumes/#emptydir . So in essence this is NOT about some shared storage or data transfer in the slightest. But seeing mkdir and writes to the container filesystem got me triggered.

   Going by the rest of the description, esp. point 4, I assume you suggest to actually remove parsing the log completely. That cannot be done ATM, at all. Why we're reading the logs is mentioned in [this comment](https://github.com/grafana/k6-operator/issues/193#issuecomment-2252956906).

2. > Remove all output redirection and grepping for error indicating strings in favor of
   > using the exit codes from the commands to indicate success or failure reestablishing the
   > clear contract of the inspect command to judge the provided config.
   
   
   Sadly, k6 is not consistent enough around logging and exit codes both to risk such big adjustments here. If you're curious, you can see issues like [this](https://github.com/grafana/k6/issues/2804) or [this](https://github.com/grafana/k6-docs/issues/877). The current logic of a complex command in initializer did not appear out of thin air, so to say: it is meant to cover all cases possible.

I understand that the exit code is not sufficient atm. But since K6 is also a project by Grafana, there should be no reason not to fix this at the root. And if some grouping or mapping of exit codes is required one can always write up an entrypoint script handling the various exit codes and mapping them to either 0 or some other value.

   Also, I suppose your [PR 453](https://github.com/grafana/k6-operator/pull/453) is kind of a continuation of the same idea: there is a more specific discussion over there.

Yes. More of a first step with as little code change as possible.

3. `TerminationMessagePolicy` on initializer: sounds interesting, but also a bit redundant, given our current reading of initializer logs.

Redundant only if you do both. My idea would be to leave the logs verbose and for the user and replace the log fetching with a fetch of the TerminationMessage. So to combine 2 and 3: If the exit code is zero the contract between initalizer and k6-operator would be that the TerminationMessage contains the required JSON. On any other issue (non-zero exit code), the pod log of the initializer contains all of the validation and linting, testing and whatnot output to help the user find and fix the issue (thus 4, no fetching the log and expecting it to only contain the required JSON.

4. > Remove fetching of container logs from initializer
   
   
   Already covered above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues within initalizer error handling if script is incorrect
2 participants